-
-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor JinjaTracer: Split into two classes, break up _slice_template()
function
#2870
Refactor JinjaTracer: Split into two classes, break up _slice_template()
function
#2870
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2870 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 163 163
Lines 12600 12632 +32
=========================================
+ Hits 12600 12632 +32
Continue to review full report at Codecov.
|
@@ -22,6 +22,7 @@ types-chardet | |||
types-appdirs | |||
types-colorama | |||
types-pyyaml | |||
types-Jinja2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the rest of the PR, but I noticed we were missing this
return RawSliceInfo(unique_alternate_id, alternate_code, []) | ||
return self.make_raw_slice_info(unique_alternate_id, alternate_code, []) | ||
|
||
def update_inside_set_or_macro(self, block_type, trimmed_parts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted this function from update_inside_set_or_macro()
.
# Exiting a set/macro block. | ||
self.inside_set_or_macro = False | ||
|
||
def make_raw_slice_info(self, *args, **kwargs) -> RawSliceInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created this helper function to centralize/reduce the number of places we check inside_set_or_macro
. This reduces the likelihood of future bugs, where we forget to check this flag.
@@ -215,9 +247,9 @@ def _slice_template(self) -> List[RawFileSlice]: | |||
# https://jinja.palletsprojects.com/en/2.11.x/api/#jinja2.Environment.lex | |||
stack = [] | |||
result = [] | |||
set_idx = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set_idx
variable was moved and renamed as a new instance field on JinjaTracer
: inside_set_or_macro
.
Tagging @OTooleMichael for review. I don't think I can explicitly add him as a reviewer yet (probably a GitHub group needs updating). |
@alanmcruickshank sent you an invite couple of days back. Check your email (and spam folder) @OTooleMichael |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM:
I read the code linearly, commented as I went and asked questions.
re_open_tag = regex.compile(r"^\s*({[{%])[\+\-]?\s*") | ||
re_close_tag = regex.compile(r"\s*[\+\-]?([}%]})\s*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these not full Regexes - ie starting with ^ and ending with $?
Also wondering if gathering the whitespace before a tag has influence on indents (I note later on that search is used vs the lexed array) perhaps some of these matches are forced by what the lexer returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The templated block is a whole string and not the individual components.
So {% if ... %}
is all one templated string.
So we want ^{%
for opening tag (simplified), and %}$
for closing tag (simplified) of that string, as @barrywhart has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past, we received one token at a time from Jinja, then combined them into a single string (str_buff
) once they became a full block, e.g. {% if a == 3 %}
. Now we still build str_buff
, but we also keep the parts separate in an array str_parts
. The regexes remain the same for legacy reasons. I think it would work either way.
Mostly we want what the lexer returns, as the purpose of this class is to deduce exactly what Jinja is going to return for the "real" template, including tricky whitespace handling. That's been working pretty well for a while. If you look at handle_left_whitespace_stripping()
and the code section (not function beginning with # Right whitespace was stripped.
, you'll see how some of the trickier bits work.
There is one case where Jinja returns the whole block at a time, and that's a {% raw %} ... {% endraw %}
section.
m_strip_right = regex.search( | ||
r"\s+$", raw, regex.MULTILINE | regex.DOTALL | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happening here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you need the DOTALL
here (there is no .
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to determine if a block uses right whitespace stripping, e.g.
SELECT
{{ 'col1,' -}}
col2
Note the -
. This tells Jinja to discard any whitespace between the }}
and the next bit of text. We need to account for this because it affects the template output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which goes to my next question further down.
Also still not sure we need the DOTALL
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember why these flags were added. Maybe they were necessary back when we only had the whole block string, rather than the parts? Note that in dbt, it's common to do things like:
DOTALL
actually means "allow the .
pattern character to match newlines, and
MULTILINE` means "allow multiline matches". https://www.codespeedy.com/re-dotall-in-python/
I'm hesitant to change this now. 😅 If the string were really long, I could see it slowing things down a bit, but this string is literally just one Jinja token -- should be very short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. It's not changed in this PR and don't think it'll matter too much one way of another. So fine leaving it.
# Handle a tag received in one go. | ||
trimmed_content = str_buff[len(m_open.group(0)) : -len(m_close.group(0))] | ||
trimmed_parts = trimmed_content.split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there need for the If and the str_parts array at all, can the Else just be applied to both scenarios? (I guess its not expensive enough to matter about the splitting computation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking if we don't need to keep the individual lex
results as separate strings?
The reason I did that is that Jinja doesn't just split on whitespace, so by keeping things as Jinja returns them, the code is more robust. Here's a test case for why:
{% set col= "col1" %}
SELECT {{ col }}
Note there's no space after col
. Before switching to str_parts
, we needed special handling since we might see any of:
{% set col= "col1" %}
{% set col = "col1" %}
{% set col ="col1" %}
This resulted in ugly, fragile code. Now it "just works"™️.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the above as a code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant why can the algo between lines 438 and 442 just be applied to both the if and the else.
the fn declaration says str_buff is non optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str_buff is required, but we prefer to use str_parts because it makes better use of Jinja lexing.
(Not sure I'm fully understanding, though.)
If you feel like tinkering with it, the tests in test/core/templaters/jinja_test.py
are pretty thorough.
""" | ||
# Find the token returned. Did lex() skip over any characters? | ||
num_chars_skipped = self.raw_str.index(token, self.idx_raw) - self.idx_raw | ||
if num_chars_skipped: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally Id prefer a guard if not num_chars_skipped: \ return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that. Previously, I had just mechanically pasted this code after pulling it out of a larger function. 👍
# to behave similarly as the left stripping case. Note that | ||
# the stakes are a bit lower here, because lex() hasn't | ||
# *omitted* any characters from the strings it returns, | ||
# it has simply grouped them differently than we want. | ||
trailing_chars = len(m_strip_right.group(0)) | ||
self.raw_sliced.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just throw this into a fn for ease of reading its nice to balance
handle_left_whitespace_stripping
even if this one is easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at doing that a few different ways. It's definitely doable, but the function ended up taking something like 5 parameters, and I couldn't find a reasonable way to get that down. (I tried to limit the new functions to 3 or fewer parameters.) The logic here is not complex, but it depends on a lot of state, so it seemed like maybe it wasn't improving the code. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
_slice_template()
function
Brief summary of the change made
Several recent fixes to
JinjaTracer
have demonstrated that it's a large class that's tricky to maintain. Two major concerns:JinjaTracer._slice_template()
function is 275 lines long with lots of intricateif
conditions, it's hard to read and easy to break. This PR splits it up into a number of smaller functions, reducing it to about 100 lines.JinjaTracer
performs two related but fairly distinct tasks. Combining these in one class makes it hard to understand because it's unclear at times which fields are used for one task and which for both.In creating this PR, I was pretty careful to avoid making behavior changes to the code. This code has pretty good automated testing now, as well as internal sanity checking of the
JinjaTracer
output. So, in reviewing this PR, I suggest focusing more on coding style and readability than correctness or "what changed from before" (because the text changes are massive due to moving things around).Are there any other side effects of this change that we should be aware of?
Pull Request checklist
Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
.yml
rule test cases intest/fixtures/rules/std_rule_cases
..sql
/.yml
parser test cases intest/fixtures/dialects
(note YML files can be auto generated withtox -e generate-fixture-yml
).test/fixtures/linter/autofix
.Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.